Loading in the data

ballot_measure_poll <- read.csv("~/Downloads/Exercise/ballot_measure_poll.csv")
individual_demographics_and_scores <- read.csv("~/Downloads/Exercise/individual_demographics_and_scores.csv")
precinct_level_election_results <- read.csv("~/Downloads/Exercise/precinct_level_election_results.csv")

Examining Structure of the data

str(ballot_measure_poll)
## 'data.frame':    1099 obs. of  29 variables:
##  $ voter_id                           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ support_initiative                 : chr  "yes" "yes" "yes" "no" ...
##  $ region                             : chr  "west" "west" "east" "east" ...
##  $ county                             : chr  "duchess" "duchess" "llandilo" "cuya" ...
##  $ education                          : chr  "college_graduate" "college_graduate" "college_graduate" "college_graduate" ...
##  $ ses                                : chr  "mid_ses" "low_ses" "wealthy" "wealthy" ...
##  $ ethnicity                          : chr  "race_A" "race_A" "race_B" "race_A" ...
##  $ ideology                           : chr  "moderate" "liberal" "liberal" "conservative" ...
##  $ kids                               : chr  "yes" "no" "yes" "yes" ...
##  $ pro_authoritarianism_score         : int  69 65 73 77 76 76 81 75 70 85 ...
##  $ pro_taxes_score                    : int  50 43 34 34 34 42 39 36 42 36 ...
##  $ pro_gunrights_score                : int  39 56 44 60 69 42 42 68 46 58 ...
##  $ pro_healthcare_score               : int  62 58 35 28 32 31 33 40 23 27 ...
##  $ pro_immigrants_score               : int  50 58 50 32 26 26 24 37 29 20 ...
##  $ pro_supporting_the_poor_score      : int  51 49 38 26 35 37 40 34 33 39 ...
##  $ environmentalist_score             : int  49 51 54 53 53 53 55 53 53 57 ...
##  $ trust_in_institutions_score        : int  46 46 46 46 46 46 46 46 46 46 ...
##  $ economic_populist_score            : int  54 45 40 23 29 31 27 31 21 32 ...
##  $ pro_military_score                 : int  40 42 56 57 59 60 61 52 55 72 ...
##  $ pro_regulation_score               : int  54 51 60 56 51 62 62 49 63 58 ...
##  $ traditionalist_score               : int  38 35 52 49 47 55 55 46 50 62 ...
##  $ compassionate_score                : int  66 66 50 43 50 52 50 47 52 43 ...
##  $ pro_free_trade_score               : int  42 49 48 46 40 30 33 48 33 37 ...
##  $ pro_globalism_score                : int  49 54 49 40 42 38 39 47 34 39 ...
##  $ pro_healthcare_women_score         : int  71 61 51 41 43 44 44 53 42 41 ...
##  $ pro_populism_score                 : int  31 38 37 30 37 30 29 36 25 37 ...
##  $ presidential_election_turnout_score: int  15 NA 13 71 76 25 72 31 31 72 ...
##  $ racial_resentment_score            : int  55 49 67 77 75 78 77 69 78 83 ...
##  $ pro_religious_freedom_score        : int  36 41 57 52 55 50 52 50 50 59 ...
str(individual_demographics_and_scores) # This is huge
## 'data.frame':    3421697 obs. of  15 variables:
##  $ precinct                                 : chr  "precinct__1287" "precinct__357" "precinct__347" "precinct__1318" ...
##  $ probability_race_G                       : num  3 1 1 59 96 49 98 94 98 97 ...
##  $ probability_race_P                       : num  0 0 0 31 2 48 1 2 1 0 ...
##  $ probability_race_O                       : num  97 99 99 2 1 3 0 2 0 0 ...
##  $ gender                                   : int  0 1 1 0 1 0 1 0 0 0 ...
##  $ age                                      : num  50 47 90 54 80 51 72 41 48 50 ...
##  $ partisan_score                           : num  40 98 99 10 27 55 10 98 0 91 ...
##  $ turnout_score                            : num  52 74 60 12 94 83 96 97 2 97 ...
##  $ probability_highest_education_high_school: num  44 64 80 71 74 45 13 23 16 15 ...
##  $ support_tax_on_wealthy_score             : num  65 80 69 79 79 81 58 87 27 83 ...
##  $ support_progressive_taxation_score       : num  12 95 31 8 86 54 47 95 9 62 ...
##  $ support_cannabis_legalization_score      : num  23 44 49 34 43 68 18 85 22 49 ...
##  $ probability_income_over_100k             : num  72 20 10 36 52 45 85 46 93 90 ...
##  $ probability_children_in_household        : num  89 84 18 52 12 80 15 89 83 60 ...
##  $ support_trump_score                      : num  77 4 34 92 14 54 29 1 95 28 ...
str(precinct_level_election_results)
## 'data.frame':    1128 obs. of  11 variables:
##  $ votes_for_candidate_I       : num  1286 728 1287 1837 169 ...
##  $ votes_for_candidate_U       : num  2974 1120 480 1833 409 ...
##  $ county                      : chr  "county__2" "county__1" "county__1" "county__3" ...
##  $ precinct                    : chr  "precinct__3" "precinct__4" "precinct__5" "precinct__7" ...
##  $ population                  : int  5164 2643 2856 4368 701 809 3995 3742 1179 3981 ...
##  $ votes_for_president         : num  4313 1892 1804 3702 584 ...
##  $ registered_voters           : num  4747 2351 2531 4052 670 ...
##  $ all_ballot_measure_votes    : num  4023 1839 1739 3523 576 ...
##  $ total_ballots               : num  4361 1899 1824 3744 586 ...
##  $ votes_against_ballot_measure: num  2216 861 500 1767 277 ...
##  $ votes_for_ballot_measure    : num  1807 978 1239 1756 299 ...

Our partners want to understand what happened on election day, please explore the precinct-level data. Precinct-level results for this ballot measure are provided in precinct_level_election_results.csv. Individual-level demographic features for voters registered to vote in these precincts is provided in individual_demographics_and_scores.csv. Please conduct a retrospective analysis of this election. - What factors do you think relate to support for this ballot measure? (use the precinct level aggregates found in precinct_votes_for_ballot_measure and precinct_votes_against_ballot_measure) - What factors do you think relate to turning out to vote in this election? (precinct_total_ballots tells you the number of ballots cast by voters in each precinct, while precinct_registered_voters gives the number of registered voters in the precinct). - What information do you think would aid our partner in voter outreach in a subsequent, similar, election?

Things notice right off the bat.

So for the individual demographics and scores there’s alot of people. I would probably just condense it down. I would probably do something more advance but I would get the average, sd, and also count for each one of these. Average would probably be a better one.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
individual_demographics_and_scores2 <- individual_demographics_and_scores %>% group_by(precinct) %>% summarise(
  g_race = mean(probability_race_G, na.rm = TRUE), 
  p_race = mean(probability_race_P, na.rm = TRUE), 
  o_race = mean(probability_race_O, na.rm = TRUE),
  gender = mean(gender, na.rm = TRUE),
  age = mean(age, na.rm = TRUE),
  partison_score = mean(partisan_score, na.rm = TRUE),
  turnout_score = mean(turnout_score, na.rm = TRUE),
  prob_highest_education_high_school = mean(probability_highest_education_high_school, na.rm = TRUE),
  support_tax_on_wealthy_score = mean(support_tax_on_wealthy_score, na.rm = TRUE),
  support_cannabis_legalization_score = mean(support_cannabis_legalization_score, na.rm = TRUE),
  income_over_100k = mean(probability_income_over_100k, na.rm = TRUE),
  children_in_household = mean(probability_children_in_household, na.rm = TRUE),
  support_trump_score = mean(support_trump_score, na.rm = TRUE)
  
  # Created but not used
  #sd_race_g = sd(probability_race_G, na.rm = TRUE), 
  #sd_race_p = sd(probability_race_P, na.rm = TRUE), 
  #sd_race_O = sd(probability_race_O, na.rm = TRUE),
  #sd_gender = sd(gender, na.rm = TRUE),
  #sd_age = sd(age, na.rm = TRUE),
  #sd_partisan_score = sd(partisan_score, na.rm = TRUE),
  #sd_turnout_score = sd(turnout_score, na.rm = TRUE),
  #sd_probability_highest_education_high_school = sd(probability_highest_education_high_school, na.rm = TRUE),
  #sd_support_tax_on_wealthy_score = sd(support_tax_on_wealthy_score, na.rm = TRUE),
  #sd_support_cannabis_legalization_score = sd(support_cannabis_legalization_score, na.rm = TRUE),
  #sd_probability_income_over_100k = sd(probability_income_over_100k, na.rm = TRUE),
  #sd_probability_children_in_household = sd(probability_children_in_household, na.rm = TRUE),
  #sd_support_trump_score = sd(support_trump_score, na.rm = TRUE)
)

# removing to reduce storage
rm(individual_demographics_and_scores)
  
# joining datasets 
bothjoined <- merge( individual_demographics_and_scores2, precinct_level_election_results, by.x = "precinct", by.y = "precinct") %>% mutate(against_percent = votes_against_ballot_measure/all_ballot_measure_votes,
                                for_percent = votes_for_ballot_measure/all_ballot_measure_votes,                                                             percent_registered_vote = registered_voters/population,
                                percent_of_registered = all_ballot_measure_votes/total_ballots
                                )

Visualizations

I am curious now in terms of visualization, the best way to do this is through pairs plots and correlations.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
my_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p
}


ggpairs(bothjoined %>% select(g_race:age, for_percent:percent_of_registered, turnout_score), lower = list(continuous = my_fn), title="Race, Gender, Age") 
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggpairs(bothjoined %>% select(partison_score:support_cannabis_legalization_score, for_percent:percent_of_registered, turnout_score),lower = list(continuous = my_fn),  title="Partison, Turnout, Education, Tax, Marijuana") 
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggpairs(bothjoined %>% select(income_over_100k:support_trump_score, for_percent:percent_of_registered, turnout_score),lower = list(continuous = my_fn), title="Income, Children, Trump") 
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   1.0.1
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(corrr)
library(igraph)
## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

tidy_cors <- bothjoined %>% select(g_race:support_trump_score,for_percent, percent_registered_vote) %>% 
  correlate() %>% 
  stretch()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
graph_cors <- tidy_cors %>%
  filter(abs(r) > .4) %>%
  graph_from_data_frame(directed = FALSE)

ggraph(graph_cors) +
  geom_edge_link(aes(edge_alpha = abs(r), edge_width = abs(r), color = r)) +
  guides(edge_alpha = "none", edge_width = "none") +
  scale_edge_colour_gradientn(limits = c(-1, 1), colors = c("firebrick2", "dodgerblue2")) +
  geom_node_point(color = "white", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_graph() 
## Using "stress" as default layout

What factors do you think relate to support for this ballot measure?

We can address this by looking at the plot above and seeing which tie in to the for_percent the most.
Using a correlation cut off of .4 or greater for above.

What we care about is for the ballot measure. - positive correlation with Partison score and legalization (probably a leftist ballot) and support tax on wealthy score - negative correlation with trump and age.

What factors do you think relate to turning out to vote in this election?

For this one, need to get the correct definition of voter turnout. This can be precinct_total_ballots/precinct_registered_voters or precinct_registered_voters/population or this can be interpretted as voting. For argument sake using percent_registered to vote, which created above. The TurnoutScore provided by another provider actually is the strongest one to correlate to this. Examining the above one can see the following.

This is correlated to turnout score - towards the education (more educated more likely to vote) - more towards the income (higher income more likely to vote ) - older more likely to vote (age) - republicans more likely to vote.

Turnout score highly correlated with percent_registered_vote

If you look at the random forest below, you can see that the age, education, children in household, and income are the biggest factors. These all make sense because age and education from above, educated and active, while children in household you care more about state of affairrs and also income have bigger money on line, thus want to be more involved.

What information do you think would aid our partner in voter outreach in a subsequent, similar, election?

Know where to put focus and emphasis on. You would want to spend your political ads on people in the middle (to shift them left to your ad) and slightly to people on left more as an awareness and small activation. So the further left they are the more they are likely to support the iniative, so the less you need to advertise to them. The more center they are the more you need to advertise. So the data here can be used to identify target demographic.

For example, you probably dont want to do older and also more conservative people. You want to advertise towards the middle and left. So for example, here its seen that older people are more likely to vote and also agains the ballot. It would be really imperative to target the young people who would like this iniative and who do not go out to vote. This would be a swing. I would probably also target those that have a lower education because it does look like they have a small correlation and a negative correlation to coming out to target.

Next Steps

Running a regression on the variables of interest

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
m0 <- lm(for_percent ~ partison_score + support_trump_score + age , data = bothjoined)
summary(m0) # So the reason why the coefficients don't make sense here is because of multicolinearity. Partison score and support trump score are highly correlated impacting the coefficient terms. So removing it (there are lots of ways to control for it, PCA, partial regression, decision tree instead, but thought just removing for now works)
## 
## Call:
## lm(formula = for_percent ~ partison_score + support_trump_score + 
##     age, data = bothjoined)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.305379 -0.035444  0.002662  0.038229  0.282578 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.3254032  0.0390801  33.915   <2e-16 ***
## partison_score      -0.0028931  0.0003024  -9.566   <2e-16 ***
## support_trump_score -0.0066246  0.0004382 -15.119   <2e-16 ***
## age                 -0.0046404  0.0003519 -13.187   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07216 on 1124 degrees of freedom
## Multiple R-squared:  0.4703, Adjusted R-squared:  0.4688 
## F-statistic: 332.6 on 3 and 1124 DF,  p-value: < 2.2e-16
vif(m0)
##      partison_score support_trump_score                 age 
##            8.278442            8.629809            1.419890
#### Doing a bunch of permutations with the different variables, it was found that the best variables that kept the coefficients interpreatble, while keeping R square highest was the ones below

m1 <- lm(for_percent ~ support_tax_on_wealthy_score +  age +prob_highest_education_high_school, data = bothjoined)
summary(m1) 
## 
## Call:
## lm(formula = for_percent ~ support_tax_on_wealthy_score + age + 
##     prob_highest_education_high_school, data = bothjoined)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.268586 -0.030921 -0.000468  0.028365  0.315107 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         0.4302965  0.0285656   15.06   <2e-16 ***
## support_tax_on_wealthy_score        0.0083642  0.0003119   26.82   <2e-16 ***
## age                                -0.0039796  0.0002963  -13.43   <2e-16 ***
## prob_highest_education_high_school -0.0032773  0.0001581  -20.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06288 on 1124 degrees of freedom
## Multiple R-squared:  0.5976, Adjusted R-squared:  0.5966 
## F-statistic: 556.5 on 3 and 1124 DF,  p-value: < 2.2e-16
vif(m1)
##       support_tax_on_wealthy_score                                age 
##                            1.67446                            1.32569 
## prob_highest_education_high_school 
##                            1.30639

For interpret ability and action, the regression above achieves the highest r square while keeping variables coefficients somewhat accurate.So the problem with where the regression fails is that there are lot of variables that are correlated and you need to check assumption. You can do multicolinearity controls or you can use more advance models to tell the impact the variables have

I used a random forest as a cross reference check, because a random forest is non parametric. There’s non linear component to it something that measures economic affinity or something This and PCA or multifactor modeling would help tease some things out btu this would be side research. KISS - keep it super ismple.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(varImp)
## Loading required package: measures
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## 
## Attaching package: 'modeltools'
## The following object is masked from 'package:car':
## 
##     Predict
## The following object is masked from 'package:igraph':
## 
##     clusters
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
model <- randomForest(for_percent ~ partison_score + support_trump_score + age + support_cannabis_legalization_score + income_over_100k + gender + turnout_score + prob_highest_education_high_school + support_tax_on_wealthy_score + children_in_household, data = bothjoined, importance=TRUE) 
importance(model)
##                                      %IncMSE IncNodePurity
## partison_score                      20.16037     1.4898947
## support_trump_score                 20.39430     2.0651338
## age                                 21.99424     0.9117849
## support_cannabis_legalization_score 23.05373     1.9483734
## income_over_100k                    26.46150     0.9902511
## gender                              22.03205     0.5026855
## turnout_score                       20.69239     0.4553624
## prob_highest_education_high_school  26.49490     0.9300000
## support_tax_on_wealthy_score        23.20174     1.1347139
## children_in_household               23.36897     0.4225857
varImpPlot(model)

This is what the random forest says for the biggest variables that impact the for campaign. In terms of direction, look at the correlation matrix from above.

model2 <- randomForest(turnout_score ~ partison_score + support_trump_score + age + support_cannabis_legalization_score + income_over_100k + gender + prob_highest_education_high_school + support_tax_on_wealthy_score + children_in_household, data = bothjoined, importance=TRUE) 
importance(model2)
##                                      %IncMSE IncNodePurity
## partison_score                      18.19551      8138.228
## support_trump_score                 15.48424      2867.535
## age                                 52.80181     13702.113
## support_cannabis_legalization_score 20.42998      4895.203
## income_over_100k                    31.65407     19061.838
## gender                              23.36440      3085.459
## prob_highest_education_high_school  39.71093     22025.812
## support_tax_on_wealthy_score        21.45176     14645.960
## children_in_household               36.02076      5962.028
varImpPlot(model2)

Look at the top 4 variables for impact for voter turnout.

Question 2

ballot_measure_poll %>% group_by(support_initiative) %>% tally()
ggplot(data = ballot_measure_poll, aes(x = region, fill = support_initiative)) + geom_bar(position = "fill") + ggtitle("Region breakdown") + ylab("proportion") 

north region has about 10% more yesses

ggplot(data = ballot_measure_poll, aes(x = education, fill = support_initiative)) + geom_bar(position = "fill") + ggtitle("Education breakdown")  + ylab("proportion") 

post graduate more likely to support

ggplot(data = ballot_measure_poll, aes(x = ses, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("Sess breakdown") # the wealthier the more they oppose to the initiave.

ggplot(data = ballot_measure_poll, aes(x = ideology, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("ideology breakdown") 

you can see that the initiative is very polarized. Conversatives don’t support while liberals do

ggplot(data = ballot_measure_poll, aes(x = kids, fill = support_initiative)) + geom_bar(position = "fill")+ ylab("proportion") + ggtitle("kids breakdown") 

Kids don’t really affect the initive except for the people that don’t know if they have kids

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_authoritarianism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_authoritarianism_score") # you can see a difference between the yes and the no. those with lower score are more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_taxes_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_taxes_score") # those with a higher tax score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_gunrights_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_gunrights_score") # those that have lower score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_healthcare_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_healthcare_score") # those more likely to have higher healthcare score more likely to say yes
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_immigrants_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_immigrants_score") # immigration score has an impact
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_supporting_the_poor_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_supporting_the_poor_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), environmentalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("environmentalist_score")
## Warning: Removed 57 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), trust_in_institutions_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("trust_in_institutions_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), economic_populist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("economic_populist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_military_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_military_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_immigrants_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_immigrants_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_supporting_the_poor_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_supporting_the_poor_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), environmentalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("environmentalist_score")
## Warning: Removed 57 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), trust_in_institutions_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("trust_in_institutions_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), economic_populist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("economic_populist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_military_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_military_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_regulation_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_regulation_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), traditionalist_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("traditionalist_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), compassionate_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("compassionate_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_free_trade_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_free_trade_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_globalism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_globalism_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_healthcare_women_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_healthcare_women_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_populism_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_populism_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), presidential_election_turnout_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("presidential_election_turnout_score")
## Warning: Removed 144 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), racial_resentment_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("racial_resentment_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

ggplot(ballot_measure_poll, aes(factor(support_initiative), pro_religious_freedom_score)) + geom_boxplot(aes(fill = factor(support_initiative)))+ ggtitle("pro_religious_freedom_score")
## Warning: Removed 53 rows containing non-finite values (stat_boxplot).

Random forest

model3 <- randomForest(support_initiative  ~ kids + ideology + ethnicity + ses + education + pro_authoritarianism_score + pro_taxes_score +                    pro_gunrights_score + pro_healthcare_score + pro_immigrants_score + pro_supporting_the_poor_score +
                        environmentalist_score + trust_in_institutions_score + economic_populist_score + pro_military_score +                  pro_regulation_score + traditionalist_score + compassionate_score + pro_free_trade_score + pro_globalism_score + pro_healthcare_women_score + pro_populism_score + presidential_election_turnout_score + racial_resentment_score + pro_religious_freedom_score , data = ballot_measure_poll %>% mutate(support_initiative = as.factor(support_initiative)) %>% na.omit(), importance=TRUE) 
importance(model3)
##                                      dont_know          no         yes
## kids                                -1.1194597  0.07733537  0.04364887
## ideology                            -0.1269588 19.93174694 19.77030611
## ethnicity                           -1.2787456 -1.20583668  2.61488550
## ses                                  1.0694766 -0.65736654  1.52352705
## education                            0.7274660 -0.04487890  1.45821377
## pro_authoritarianism_score          -4.4620069  7.53224881  6.27183314
## pro_taxes_score                     -2.0302918  1.81429163  8.60576884
## pro_gunrights_score                 -2.4327574  0.56480653  8.66763527
## pro_healthcare_score                -2.4629323  4.43552881  9.54698186
## pro_immigrants_score                -5.7593275  2.69825398 13.75682067
## pro_supporting_the_poor_score       -2.7508420  4.12645125  6.63006463
## environmentalist_score              -4.2763134 -1.05031702  9.44607546
## trust_in_institutions_score         -1.0010015 -1.41625318 -1.45860804
## economic_populist_score             -1.0328451  1.21196674  6.81716061
## pro_military_score                  -5.4240427  4.53879710  7.23478673
## pro_regulation_score                -3.5670624 -0.09689486  7.24648622
## traditionalist_score                -6.1877662  1.23901842 13.37928226
## compassionate_score                 -0.4480175  4.56035339  7.40479584
## pro_free_trade_score                -2.6183819 -0.72839974  9.41918130
## pro_globalism_score                 -5.0047052 -0.75046600 12.38866988
## pro_healthcare_women_score          -3.5551173  0.12778516  9.65121720
## pro_populism_score                  -3.0239938  2.56743261  7.10037601
## presidential_election_turnout_score -1.2209358  4.75628577 10.99726842
## racial_resentment_score             -3.4365494  0.95923996 13.80944339
## pro_religious_freedom_score         -2.4471525  1.90083969  8.99902604
##                                     MeanDecreaseAccuracy MeanDecreaseGini
## kids                                         -0.09221867        5.1488610
## ideology                                     27.65257816       31.3879340
## ethnicity                                     1.15006375        5.2273769
## ses                                           0.96647836       15.1946352
## education                                     1.27324192       12.4294274
## pro_authoritarianism_score                   11.61420552       22.7781532
## pro_taxes_score                               9.35860983       20.2756725
## pro_gunrights_score                           7.95302187       25.7435747
## pro_healthcare_score                         11.76805568       18.6109935
## pro_immigrants_score                         15.10442023       20.6473256
## pro_supporting_the_poor_score                 9.13858641       23.1518617
## environmentalist_score                        8.30576485        9.2104605
## trust_in_institutions_score                  -1.78703160        0.7279759
## economic_populist_score                       7.72457411       23.7077570
## pro_military_score                            9.09507735       19.4037318
## pro_regulation_score                          6.48754952       17.4306239
## traditionalist_score                         13.03994118       18.3979080
## compassionate_score                          10.60685707       25.4668593
## pro_free_trade_score                          8.48538902       16.4223647
## pro_globalism_score                          10.66151048       18.1453743
## pro_healthcare_women_score                   10.01295504       20.8931286
## pro_populism_score                            7.58373553       19.2750274
## presidential_election_turnout_score          13.26342028       32.1040624
## racial_resentment_score                      15.58860157       21.8430805
## pro_religious_freedom_score                  11.05598220       19.4607859
varImpPlot(model3)

These list the variables and their impact.

● Which score(s) would you recommend that the partners use for doing voter outreach so the ballot measure will be successful?

It seems a good chunk of these have some sort of predictive power. Based off visuals. I would rank them b the ones that the random forest sorted like: Looks like ideology, pro immigrants score,

● How should our partner use this score?

You should use the scores to asssess the liklihood that a candidate will vote and als understand the character profile of this candidate. Based off those characteristics and profile target them

We can create a machine learning classifier to rank the probability that a person will vote yes. We can target those that are 50% or above. You can see how easy it would be to change opinion on certain variables and try weighted targetting based off that

● How would you convey the value of using the targeting strategy you recommend?

I would compare the results that we have now with the strategy that was recomended and see if this would change. The delta I would attribute to the strategy and the targetting efforts.